Load all required packages for this report and define some global customization settings.
# for knitting the .rmd file to .html
if (! requireNamespace("knitr", quietly = TRUE)) {
install.packages("knitr")
}
if (! requireNamespace("kableExtra", quietly = TRUE)) {
install.packages("kableExtra")
}
# for creating models and analyzing expression data
if (! requireNamespace("edgeR", quietly = TRUE)) {
BiocManager::install("edgeR")
}
if (! requireNamespace("limma", quietly = TRUE)) {
BiocManager::install("limma")
}
if (! requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
if (! requireNamespace("gprofiler2", quietly = TRUE)) {
install.packages("gprofiler2")
}
# for creating plots
if (! requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (! requireNamespace("ggrepel", quietly = TRUE)) {
install.packages("ggrepel")
}
if (! requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
if (! requireNamespace("circlize", quietly = TRUE)) {
BiocManager::install("circlize")
}
library(edgeR)
library(dplyr)
library(ggplot2)
# wraps lines that are too long
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=80), tidy=TRUE)
# set default behaviors for all chunks
knitr::opts_chunk$set(warning = FALSE, message = FALSE)All table and figure outputs in this report are rendered using the knitr package [1], and kableExtra package [2] for styling. Some codes in this report are adapted from Lecture 6 - Differential Expression [3] and Lecture 7 - Annotation Resources and Prelim ORA [4].
Glucocorticoids (GC) is a class of steroid hormones that plays a role in regulating the immune system and certain aspects of the immune function, such as inflammation. The most common naturally-produced GC hormone in humans is cortisol, which are produced by the adrenal glands. Due to GC’s potent anti-inflammatory effects, several synthetic forms of GC are used as immunosuppressive drugs to treat various medical conditions such as asthma, allergies, autoimmune disorders, and cancer. [5]
The data set used in this report comes from Quatrini et al. (2022)’s paper: Glucocorticoids inhibit human hematopoietic stem cell differentiation toward a common ILC precursor [5], where they analyzed the the role of GC on regulating innate lymphoid cells (ILCs) differentiation, including cytotoxic natural killer cells and helper ILCs, from human hematopoietic stem cells (HSCs). The RNA-seq data set used in this report is a part of the overall study; it evaluates the effect of the presence of Dexamethasone (DEX), a glucocorticoid medication, on gene expressions of HSCs. The raw data set is downloaded from the NCBI Gene Expression Omnibus, with ID GSE186950. [6]
In Assignment 1, I performed initial pre-processing and normalization of the data set. The raw data set contains 2 groups, control and conditioned (DEX), with 3 samples in each group: AF25, AF26, AF29. The data set initially contained gene expressions of 58683 genes, but 46243 genes were removed due to low counts with less than 1 count per million in at least 3 samples, leaving 12440 genes.
In the original data set, genes are labeled in a mix of HUGO gene symbols, GenBank accession IDs, EMBL identifiers, etc. Using the biomaRt package [7], we attempted to map non-HUGO identifiers to their corresponding HUGO gene symbols, but had to discard 847 (6.8%) genes with no matches, leaving a final set of 11593 unique genes.
knitr::include_graphics("figures/A2/raw_fltr_data_cpm.png")Figure 1. log2 CPM Distributions of gene expressions of each sample in the data set, after filtering out genes with low-counts or without HUGO gene symbol match, and before TMM normalization. This figure is adapted from Assignment 1 with slight aesthetic modifications.
As shown in Fig.1, The filtered data set is approximately normally distributed with no outlier samples, and therefore we normalized the data using the Trimmed Mean of M-values (TMM) method, via the edgeR package. [8]
Since we’ll be using functions from the edgeR package to build the
differential expression models in this report, the DGEList object is
needed. Therefore, we’ll load the filtered data set (in counts), which
excludes the removed genes as outlined above, via an .rds file, and then
repeat the TMM-normalization process. The
raw_fltr_matrix.rds file can be generated by running
saveRDS(raw_fltr_matrix, 'raw_fltr_matrix.rds') after
running all code chunks in a1.Rmd.
raw_fltr_matrix <- readRDS("raw_fltr_matrix.rds")
# group each sample into either the control (CTRL) or the conditioned (DEX)
# group.
groups <- data.frame(matrix(NA, nrow = 6, ncol = 2))
for (i in colnames(raw_fltr_matrix)) {
idx <- match(i, colnames(raw_fltr_matrix))
if (grepl("DEX", i)) {
groups[idx, 1] <- "DEX"
groups[idx, 2] <- substring(i, 1, 4)
} else {
groups[idx, 1] <- "CTRL"
groups[idx, 2] <- substring(i, 1, 4)
}
}
rownames(groups) <- colnames(raw_fltr_matrix)
colnames(groups) <- c("treatment", "sample")
kableExtra::kable_paper(knitr::kable(groups, format = "html"))| treatment | sample | |
|---|---|---|
| AF25 | CTRL | AF25 |
| AF26 | CTRL | AF26 |
| AF29 | CTRL | AF29 |
| AF25DEX | DEX | AF25 |
| AF26DEX | DEX | AF26 |
| AF29DEX | DEX | AF29 |
Table 1. An Overview of the groups in the data set. The data set contains two factors: 1) Treatment, where each sample is catogorized into either the control (CTRL) or the conditioned (DEX); 2) Sample, where each sample belongs to one of the three biological entities: AF25, AF26, and AF29.
# Normalization via the TMM method
dge <- edgeR::DGEList(counts = raw_fltr_matrix, group = groups$treatment)
# calculate normalization factor
dge <- calcNormFactors(dge)
# get the normalized data
normalized_cpm <- cpm(dge)
# A preview of the normalized data.
kableExtra::kable_paper(knitr::kable(data.frame(head(normalized_cpm)), format = "html"))| AF25 | AF26 | AF29 | AF25DEX | AF26DEX | AF29DEX | |
|---|---|---|---|---|---|---|
| A1BG-AS1 | 8.844446 | 6.115007 | 7.757708 | 6.53587 | 6.690002 | 5.139955 |
| A2M | 82.753247 | 86.173319 | 117.025851 | 69.00011 | 61.464389 | 45.698876 |
| AAAS | 49.990344 | 35.644053 | 41.346933 | 43.04337 | 52.579231 | 44.203616 |
| AACS | 42.991696 | 31.138258 | 28.720025 | 47.99196 | 42.021572 | 35.699326 |
| AAGAB | 47.836914 | 42.965969 | 34.001869 | 55.74163 | 50.279542 | 46.166145 |
| AAK1 | 51.220876 | 52.782164 | 50.920275 | 55.64826 | 44.739385 | 48.689395 |
Table 2. A preview of the normalized data set, in counts per million (CPM). Each row represents a unique gene as identified by their HUGO gene symbol in the row name, and each column represents a sample in the data set.
To start, we first create a heat map of gene expressions across all genes and all samples in the data set. This allows us to get an rough idea of the expression change patterns at-large. All heatmaps in this report are generated using the ComplexHeatmap package [9] and the circlize package [10] for generating interpolated colours.
# Create a row-normalized heatmap matrix
heatmap_matrix <- t(scale(t(normalized_cpm)))The range of values in our heatmap matrix is [-2.0352289,
2.041092].
The two ends are approximately the same in magnitude, and so we use 0 as
the middle point and set the color scheme to be “blue - white - red”
where blue corresponds to lower gene expressions and red corresponds to
higher gene expressions.
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix), 0, max(heatmap_matrix)),
c("blue", "white", "red"))
# Create the heatmap
gene_express_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix), show_row_dend = TRUE,
show_column_dend = TRUE, col = heatmap_col, show_column_names = TRUE, show_row_names = FALSE,
show_heatmap_legend = TRUE, name = "row-normalized gene \nexpression (CPM)",
km = 3, column_title = "Row-Normalized Differential Gene Expressions (in CPM) In Each Sample")
gene_express_heatmapFigure 2. Heatmap of row-normalized gene expressions, in counts-per-million (CPM), in each sample of the data set. Column-wise, each sample is hierarchically clustered based on the similarity of their gene expressions. Row-wise, each gene is hierarchically clustered based on their expressions across the samples. A k-mean clustering of size 3 is applied to the rows to cluster genes into 3 groups based on similarity in expressions.
In Fig.2, column-wise dendrogram shows that the six samples were cleanly clustered into two groups, control (CTRL) and conditioned (DEX). Row-wise, although the clustering pattern between individual genes are too difficult to identify, the three main groups as clustered by k-means do express some distinct differences between each other, andto an extent, between the control and conditioned group in each cluster.
An interesting note is that between the three biological entities (AF25, AF26, AF29), AF29 exhibits some unique expression patterns that differs from AF25 and AF26, in both the control and conditioned samples. The original paper where the data set came from did not have mentioning nor explanations to this difference, so it is likely that the difference is purely based on biological variability between the samples. Nevertheless, to verify, we built a multidimensional scaling (MDS) plot using the limma package [11] to illustrate the distances between samples.
limma::plotMDS(heatmap_matrix, col = c("blue", "red")[factor(groups$treatment)],
main = "Distance Between Samples")Figure 3. Multidimensional scaling (MDS) plot of samples in the data set. Distances on the plot are approximations of the typical log2 fold changes between the samples. Samples in the control (CTRL) group are colored blue and samples in the conditioned (DEX) group are colored red.
The MDS plot (Fig.3) supports the clustering patterns found in the heat map, both between the groups and between samples in each group. Namely, samples in the control group are clustered away from the conditioned group, and with in each of the two main clusters, AF29 are positioned farther away from AF25 and AF26. This aligns with the unqiue gene expression profiles of AF29 that observed in the heat map.
Without accounting for the biological variabilities between samples, our data set is quite simple with only two groups: control and conditioned. Therefore, we’ll first use the Exact Test model from the edgeR package [8] to test for the differential gene expressions between the two groups.
model_cond <- model.matrix(~groups$treatment + 0)
colnames(model_cond) <- c("CTRL", "DEX")
rownames(model_cond) <- colnames(normalized_cpm)
kableExtra::kable_paper(knitr::kable(model_cond, format = "html"))| CTRL | DEX | |
|---|---|---|
| AF25 | 1 | 0 |
| AF26 | 1 | 0 |
| AF29 | 1 | 0 |
| AF25DEX | 0 | 1 |
| AF26DEX | 0 | 1 |
| AF29DEX | 0 | 1 |
Table 3. Design of Model #1. In this model, only the treatment condition is considered and each sample belongs to either CTRL or DEX (conditioned).
dge_cond <- edgeR::estimateDisp(dge, model_cond)
et <- edgeR::exactTest(dge_cond, pair = c("CTRL", "DEX"))
top_et <- topTags(et, sort.by = "PValue", n = nrow(normalized_cpm))
kableExtra::kable_paper(knitr::kable(top_et$table[1:10, ], format = "html", digits = 150))| logFC | logCPM | PValue | FDR | |
|---|---|---|---|---|
| DAAM2 | 12.219507 | 7.260689 | 2.995010e-129 | 3.472115e-125 |
| ADAMTS2 | 8.592964 | 6.707461 | 1.025497e-77 | 5.944291e-74 |
| LPL | 4.320328 | 6.105950 | 1.115675e-55 | 4.311341e-52 |
| IL1R2 | 7.398576 | 7.941220 | 4.001718e-55 | 1.159798e-51 |
| TIFAB | -4.210233 | 5.021435 | 2.691740e-53 | 6.241069e-50 |
| MRC2 | -5.582371 | 5.082934 | 3.562996e-53 | 6.884302e-50 |
| AIG1 | 3.172247 | 6.159028 | 2.335066e-50 | 3.859080e-47 |
| FCER1A | -4.956970 | 6.061634 | 2.663041e-50 | 3.859080e-47 |
| MAOA | 7.441233 | 5.763348 | 2.451648e-47 | 3.157995e-44 |
| TPST1 | 4.763426 | 5.550884 | 2.007750e-46 | 2.327585e-43 |
Table 4. A preview of the top 10 differentially-expressed genes (in order of ascending p-value) under the treatment-only exact test model.
To visualize the set of differentially expressed genes, we’ll use a
Volcano plot. A portion of the following code is adapted from this
tutorial [12] in terms of
customization and gene label placements. The %>%
shortcut from the dplyr package [13] was
used for data manipulation, and the ggplot2 package [14] and the ggrepel package [15] were used to generate the Volcano
plots.
top_et_label <- top_et$table %>%
mutate(Expression = case_when(logFC > 0 & FDR < 0.05 ~ "Up-regulated", logFC <
0 & FDR < 0.05 ~ "Down-regulated", TRUE ~ "Unchanged"))
# label the top five genes in either direction of change
top <- 5
top_genes_et <- bind_rows(top_et_label %>%
filter(Expression == "Up-regulated") %>%
arrange(FDR, desc(abs(logFC))) %>%
head(top), top_et_label %>%
filter(Expression == "Down-regulated") %>%
arrange(FDR, desc(abs(logFC))) %>%
head(top))
volplot_et <- ggplot2::ggplot(top_et_label, aes(logFC, -log(FDR, 10))) + geom_point(aes(color = Expression),
size = 0.6) + xlab(expression("log"[2] * "FC")) + ylab(expression("-log"[10] *
"FDR")) + scale_color_manual(values = c("dodgerblue3", "gray50", "firebrick3")) +
guides(colour = guide_legend(override.aes = list(size = 4)))
volplot_et <- volplot_et + ggrepel::geom_label_repel(data = top_genes_et, mapping = aes(logFC,
-log(FDR, 10), label = rownames(top_genes_et)), size = 3.5) + ggtitle("Volcano Plot of Differentially Expressed Genes Distribution in Model #1")
volplot_etFigure 4. Volcano plot of differentially expressed genes in Model #1, accounting for only the treatment variability using the Exact Test Model from edgeR. Differentially expressed genes are coloured based on direction of change with blue indicating down-regulated genes and red indicating up-regulated genes (FDR-corrected p<0.05). Genes with the top five log fold change (in magnitude) in either directions are labelled.
top_et_genes <- top_et[top_et$table$FDR < 0.05, ]
top_et_genes <- rownames(top_et_genes$table)
heatmap_matrix_tophits_et <- heatmap_matrix[which(rownames(heatmap_matrix) %in% top_et_genes),
]
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits_et), 0, max(heatmap_matrix_tophits_et)),
c("blue", "white", "red"))
# Set colours
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")
# Set annotations
treatment_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
"DEX"), c(3, 3))), col = list(treatment = treat_colours))
model_1_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits_et),
cluster_rows = TRUE, cluster_columns = FALSE, show_row_dend = TRUE, show_column_dend = FALSE,
col = heatmap_col, show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
top_annotation = treatment_col, column_title = "Annotated Heatmap of Model #1",
name = "row-normalized gene \nexpression (CPM)", )
model_1_heatmapFigure 5. Annotated heatmap of differentially expressed genes in Model #1, comparing conditioned to control samples. Samples are annotated based on treatment conditions.
Recall that in the global heatmap, we observed some differences in terms of gene expressions between the three biological entities. Although the paper did not mention this difference, we’ll build another model to see how sample variability affect out results. We use the Quasi-Likelihood Model from the edgeR package [8], and include both the treatment and sample conditions.
model_samp <- model.matrix(~groups$sample + groups$treatment)
rownames(model_samp) <- colnames(normalized_cpm)
kableExtra::kable_paper(knitr::kable(model_samp, format = "simple"))| (Intercept) | groups$sampleAF26 | groups$sampleAF29 | groups$treatmentDEX | |
|---|---|---|---|---|
| AF25 | 1 | 0 | 0 | 0 |
| AF26 | 1 | 1 | 0 | 0 |
| AF29 | 1 | 0 | 1 | 0 |
| AF25DEX | 1 | 0 | 0 | 1 |
| AF26DEX | 1 | 1 | 0 | 1 |
| AF29DEX | 1 | 0 | 1 | 1 |
Table 5. Design of Model #2. In this model, both the treatment and sample condition are considered and each sample belongs to one of (CTRL, DEX) as well as one of (AF25, AF26, AF29).
# Estimate dispersion
dge_samp <- edgeR::estimateDisp(dge, model_samp)
# Fit the Quasi Likelihood Model
fit <- edgeR::glmQLFit(dge_samp, model_samp)
qlf <- edgeR::glmQLFTest(fit, coef = "groups$treatmentDEX")
top_qlf <- topTags(qlf, sort.by = "PValue", n = nrow(normalized_cpm))
kableExtra::kable_paper(knitr::kable(top_qlf$table[1:10, ], format = "html", digits = 20))| logFC | logCPM | F | PValue | FDR | |
|---|---|---|---|---|---|
| DAAM2 | 12.214339 | 7.258871 | 1107.3521 | 3.2000e-19 | 3.690360e-15 |
| ADAMTS2 | 8.644263 | 6.704992 | 881.9582 | 3.0800e-18 | 1.788146e-14 |
| IL1R2 | 7.360969 | 7.940311 | 841.6158 | 4.9200e-18 | 1.899649e-14 |
| HLA-DQA1 | -7.099430 | 6.611630 | 679.4327 | 4.1110e-17 | 1.191407e-13 |
| MAOA | 7.404167 | 5.759322 | 661.2241 | 5.3780e-17 | 1.246887e-13 |
| MFGE8 | 4.399029 | 7.609859 | 619.0729 | 1.0307e-16 | 1.991496e-13 |
| HLA-DPB1 | -5.698476 | 7.211961 | 545.8348 | 3.5611e-16 | 5.897655e-13 |
| TMIGD3 | 5.491745 | 5.200044 | 535.7945 | 4.2736e-16 | 6.078592e-13 |
| CYP4F3 | 5.637621 | 5.289852 | 530.4112 | 4.7190e-16 | 6.078592e-13 |
| CD74 | -3.704543 | 11.448976 | 519.9534 | 5.7374e-16 | 6.651424e-13 |
Table 6. A preview of the top 10 differentially-expressed genes (in order of ascending p-value) under the treatment+sample QLM model.
top_qlf_label <- top_qlf$table %>%
mutate(Expression = case_when(logFC > 0 & FDR < 0.05 ~ "Up-regulated", logFC <
0 & FDR < 0.05 ~ "Down-regulated", TRUE ~ "Unchanged"))
# label the top five genes in either direction of change
top <- 5
top_genes_qlf <- bind_rows(top_qlf_label %>%
filter(Expression == "Up-regulated") %>%
arrange(FDR, desc(abs(logFC))) %>%
head(top), top_qlf_label %>%
filter(Expression == "Down-regulated") %>%
arrange(FDR, desc(abs(logFC))) %>%
head(top))
volplot_qlf <- ggplot2::ggplot(top_qlf_label, aes(logFC, -log(FDR, 10))) + geom_point(aes(color = Expression),
size = 0.6) + xlab(expression("log"[2] * "FC")) + ylab(expression("-log"[10] *
"FDR")) + scale_color_manual(values = c("dodgerblue3", "gray50", "firebrick3")) +
guides(colour = guide_legend(override.aes = list(size = 4)))
volplot_qlf <- volplot_qlf + ggrepel::geom_label_repel(data = top_genes_qlf, mapping = aes(logFC,
-log(FDR, 10), label = rownames(top_genes_qlf)), size = 3.5, max.overlaps = Inf) +
ggtitle("Volcano Plot of Differentially Expressed Genes Distribution in Model #2")
volplot_qlfFigure 6. Volcano plot of differentially expressed genes in Model #2, accounting for both the treatment and sample variability using the Quasi-Liklihood Model from edgeR. Differentially expressed genes are coloured based on direction of change with blue indicating down-regulated genes and red indicating up-regulated genes (FDR-corrected p<0.05). Genes with the top five log fold change (in magnitude) in either directions are labelled.
# Subset the threshold gene
top_qlf_genes <- top_qlf[top_qlf$table$FDR < 0.05, ]
top_qlf_genes <- rownames(top_qlf_genes$table)
heatmap_matrix_tophits_qlf <- heatmap_matrix[which(rownames(heatmap_matrix) %in%
top_qlf_genes), ]
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_tophits_qlf), 0, max(heatmap_matrix_tophits_qlf)),
c("blue", "white", "red"))
# Set colours
treat_colours <- c("orange", "darkblue")
names(treat_colours) <- c("DEX", "CTRL")
samp_colors <- hcl.colors(3, palette = "set 2")
names(samp_colors) <- c("AF25", "AF26", "AF29")
# Set annotations
annot_col <- ComplexHeatmap::HeatmapAnnotation(df = data.frame(treatment = rep(c("CTRL",
"DEX"), c(3, 3)), sample = c("AF25", "AF26", "AF29", "AF25", "AF26", "AF29")),
col = list(treatment = treat_colours, sample = samp_colors))
model_2_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_tophits_qlf),
cluster_rows = TRUE, cluster_columns = FALSE, show_row_dend = TRUE, show_column_dend = FALSE,
col = heatmap_col, show_column_names = TRUE, show_row_names = FALSE, show_heatmap_legend = TRUE,
top_annotation = annot_col, column_title = "Annotated Heatmap of Model #2", name = "row-normalized gene \nexpression (CPM)",
)
model_2_heatmapFigure 7. Annotated heatmap of differentially expressed genes in Model #2, comparing conditioned to control samples as well as between the three biological entities. Samples are annotated based on treatment and sample conditions.
model_1 <- data.frame(rownames(top_et$table), top_et$table$FDR)
model_2 <- data.frame(rownames(top_qlf$table), top_qlf$table$FDR)
# Merge the two model
merged_model <- merge(model_1, model_2, by.x = 1, by.y = 1)
colnames(merged_model) <- c("gene", "model_1_fdr", "model_2_fdr")
merged_model$colour <- "gray60"
merged_model$colour[merged_model$model_1_fdr < 0.05] <- "orange"
merged_model$colour[merged_model$model_2_fdr < 0.05] <- "dodgerblue3"
merged_model$colour[merged_model$model_1_fdr < 0.05 & merged_model$model_2_fdr <
0.05] <- "firebrick3"
# Output comparison plot
plot(merged_model$model_1_fdr, merged_model$model_2_fdr, col = merged_model$colour,
xlab = "Model #1 p-values (FDR-corrected)", ylab = "Model #2 p-values (FDR-corrected)",
main = "Distribution of FDR-Corrected p-values for Genes in Model #1 and #2")
legend("topleft", legend = c("Model #1", "Model #2", "Both", "Not signif"), fill = c("orange",
"dodgerblue3", "firebrick3", "gray60"))Figure 8. Comparisons between Model #1 and #2 in terms of FDR-corrected p-value distribution.
From Fig.8, it is evident that Model #2 (treatment & sample variability) outputted more genes that are significantly differentially expressed. The final choice of model for over-representation analysis is explained in Model Choice.
1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differentially expressed? What thresholds did you use and why?
I set the p-value threshold to be 0.05, where genes with p<0.05 are considered significantly differentially expressed. This threshold aligns with what the authors of this data set used in their paper [5], and it is also one of the mainstream choices in gene expression analyses that balances between the risk of false positives (picking up biologically-insignificant genes) and and false negatives (omitting biologically-significant genes).
In Model #1 (Treatment-Only Exact Test Model), 2808 genes pass the p-value threshold.
In Model #2 (Treatment + Sample Quasi-Liklihood Model), 4100 genes pass the p-value threshold.
Details of the p-value calculation for each model can be found in Model Design.
2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction?
I used Benjamini-Hochberg FDR method for multiple hypothesis
correction, with p-value(FDR)<0.05 indicating a significant
differential expression. This is automatically calculated as a part of
the exactTest and glmQLFTest function [8]. The BH-FDR method is chosen because we wish
to control false positive results that are more likely to occur the more
testings that we do. This method assumes that the probability of
detecting a significant discovery is equally likely among each of the
tests, which is a reasonable assumption for each genes in our model, and
is commonly used for other similar gene expression analyses. [16]
In Model #1 (Treatment-Only Exact Test Model), 1658 genes pass the p-value threshold after FDR correction.
In Model #2 (Treatment + Sample Quasi-Liklihood Model), 2945 genes pass the p-value threshold after FDR correction.
3. Show the amount of differentially expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest.
See Model #1 Volcano Plot and Model #2 Volcano Plot. The publication where my data set comes from did not specify any specific genes of interest, and therefore I highlighted the top 5 genes (with the greatest log fold change value) in both the up-regulated and down-regulated path.
4. Visualize your top hits using a heatmap. Do you conditions cluster together? Explain why or why not.
See Model #1 Heat Map and Model #2 Heat Map. In both models, samples in each treatment conditions (CTRL vs. DEX) do cluster together among themselves, as shown in the clear separation between the two colors in the top annotation bar labeled ‘treatment’. Additionally, in Model #2’s heatmap, as shown in the column dendrogram and in the heatmap matrix, AF25 seems to cluster with AF26, while AF29 is more distinct compared to the rest in both the CTRL and DEX sample. These findings align with the pattern we saw in the global heatmap from Global Overview of Differential Gene Expression, and confirms that the presence of dexamethasone do have some impact on the expressions of certain genes.
In the last section, we identified the set of genes that are significantly up-regulated or down-regulated in each of our two models. Comparing Model #1 and #2, the top hits (with greatest logFC magnitudes) are mostly consisted of the same set of genes in slightly different orders. This is also evident in Fig.8 Comparisons Between the Two Models where the genes with p-value (FDR) < 0.05 in both models clustered in the bottom left corner in red. Although Model #2 returns more genes that are significantly differentially expressed (p-value (FDR) < 0.05, model #1: 1658, model #2: 2945), the p-values of its top hits are generally much larger than that of Model #1. In addition, comparing the heat maps of the two models, the significant set of genes in model #1 show a clearer distinction between the CTRL and DEX clusters than model #2. Lastly, Model #1 already contain a substantial set (1658) of significantly differentially-expressed genes, and adding more genes into our analyses may introduce more noise than biologically-relevant discoveries. Thus, we’ll do the subsequent analyses with Model #1.
In Fig.5 Model #1 Heat Map, it is indicated that the presence of dexamethasone do have some impact on gene expressions of certain genes. In this section, we’ll use g:Profiler [17] to run a thresholded gene set enrichment analysis on the significant genes to see which pathways are these genes involved in. We’ll use the gprofiler2 package [18] to directly access g:Profiler from R.
de_genes <- top_et$table
# extract genes that are up-regulated and down-regulated.
de_genes <- de_genes[which(de_genes$FDR < 0.05 & abs(de_genes$logFC) > log(2)), ]
upreg_genes <- de_genes[which(de_genes$logFC > log(2)), ]
downreg_genes <- de_genes[which(de_genes$logFC < -log(2)), ]# Obtain data source version information
version_info <- gprofiler2::get_version_info(organism = "hsapiens")
version_info <- version_info$sources
data_sources <- c("GO:BP", "KEGG", "REAC", "WP")
version_info <- version_info[data_sources]
ver_info_df <- data.frame(sapply(version_info, `[[`, 2))
colnames(ver_info_df) <- c("Version")
# the output table of version information can be found in section 'Thresholded
# Over-Representation Analysis Questions'gp_de <- gprofiler2::gost(query = rownames(de_genes), organism = "hsapiens", sources = c("GO:BP",
"KEGG", "REAC", "WP"), significant = TRUE, user_threshold = 0.05, correction_method = "fdr",
exclude_iea = TRUE)Interactive Manhattan Plot of g:Profiler Enrichment Analysis of All Significant Differentially Expressed Genes
Figure 9. Interactive Manhattan plot of g:Profiler output of all significant differentially expressed genes, including both up-regulated and down-regulated, thresholded based on p-value (FDR) < 0.05 and abs(logFC)>log(2). The enrichment analysis is done from 4 data sources: GO:BP, KEGG, REAC, WP, as labeled by color.
top_n <- 1:2
gp_de_result <- gp_de$result[order(gp_de$result$p_value, decreasing = FALSE), ]
gp_de_result <- subset(gp_de_result, select = c("term_id", "source", "term_name",
"p_value"))
top_terms_de_1 <- gp_de_result[which(gp_de_result$source == "GO:BP"), ][top_n, ]
top_terms_de_2 <- gp_de_result[which(gp_de_result$source == "KEGG"), ][top_n, ]
top_terms_de_3 <- gp_de_result[which(gp_de_result$source == "REAC"), ][top_n, ]
top_terms_de_4 <- gp_de_result[which(gp_de_result$source == "WP"), ][top_n, ]
top_terms_de <- rbind(top_terms_de_1, top_terms_de_2, top_terms_de_3, top_terms_de_4)
row.names(top_terms_de) <- NULL
kableExtra::kable_paper(knitr::kable(top_terms_de, format = "html", digits = 50))| term_id | source | term_name | p_value |
|---|---|---|---|
| GO:0002376 | GO:BP | immune system process | 9.871125e-28 |
| GO:0023052 | GO:BP | signaling | 2.926590e-22 |
| KEGG:04640 | KEGG | Hematopoietic cell lineage | 1.652865e-05 |
| KEGG:05310 | KEGG | Asthma | 1.142178e-04 |
| REAC:R-HSA-168256 | REAC | Immune System | 7.353197e-15 |
| REAC:R-HSA-6798695 | REAC | Neutrophil degranulation | 6.701769e-09 |
| WP:WP4884 | WP | Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex | 4.206349e-03 |
| WP:WP2882 | WP | Nuclear receptors meta-pathway | 2.337374e-02 |
Table 7. Top two pathways (ordered by ascending p-values) returned by each of the four data sources in differentially expressed genes.
gp_upreg <- gprofiler2::gost(query = rownames(upreg_genes), organism = "hsapiens",
sources = c("GO:BP", "KEGG", "REAC", "WP"), significant = TRUE, user_threshold = 0.05,
correction_method = "fdr", exclude_iea = TRUE)Interactive Manhattan Plot of g:Profiler Enrichment Analysis of All Significant Up-Regulated Genes
Figure 10. Interactive Manhattan plot of g:Profiler output of all significant up-regulated genes, thresholded based on p-value (FDR) < 0.05 and logFC>log(2). The enrichment analysis is done from 4 data sources: GO:BP, KEGG, REAC, WP, as labeled by color.
gp_upreg_result <- gp_upreg$result[order(gp_upreg$result$p_value, decreasing = FALSE),
]
gp_upreg_result <- subset(gp_upreg_result, select = c("term_id", "source", "term_name",
"p_value"))
top_terms_upreg_1 <- gp_upreg_result[which(gp_upreg_result$source == "GO:BP"), ][top_n,
]
top_terms_upreg_2 <- gp_upreg_result[which(gp_upreg_result$source == "KEGG"), ][top_n,
]
top_terms_upreg_3 <- gp_upreg_result[which(gp_upreg_result$source == "REAC"), ][top_n,
]
top_terms_upreg_4 <- gp_upreg_result[which(gp_upreg_result$source == "WP"), ][top_n,
]
top_terms_upreg <- rbind(top_terms_upreg_1, top_terms_upreg_2, top_terms_upreg_3,
top_terms_upreg_4)
row.names(top_terms_upreg) <- NULL
kableExtra::kable_paper(knitr::kable(top_terms_upreg, format = "html", digits = 50))| term_id | source | term_name | p_value |
|---|---|---|---|
| GO:0006954 | GO:BP | inflammatory response | 1.290966e-08 |
| GO:0016477 | GO:BP | cell migration | 3.190944e-07 |
| KEGG:01100 | KEGG | Metabolic pathways | 1.325951e-05 |
| KEGG:00520 | KEGG | Amino sugar and nucleotide sugar metabolism | 2.387054e-02 |
| REAC:R-HSA-6798695 | REAC | Neutrophil degranulation | 7.400413e-25 |
| REAC:R-HSA-168249 | REAC | Innate Immune System | 1.353385e-15 |
| WP:WP2882 | WP | Nuclear receptors meta-pathway | 3.994737e-04 |
| WP:WP2880 | WP | Glucocorticoid receptor pathway | 2.913073e-03 |
Table 8. Top two pathways (ordered by ascending p-values) returned by each of the four data sources in up-regulated genes.
gp_downreg <- gprofiler2::gost(query = rownames(downreg_genes), organism = "hsapiens",
sources = c("GO:BP", "KEGG", "REAC", "WP"), significant = TRUE, user_threshold = 0.05,
correction_method = "fdr", exclude_iea = TRUE)Interactive Manhattan Plot of g:Profiler Enrichment Analysis of All Significant Down-Regulated Genes
Figure 11. Interactive Manhattan plot of g:Profiler output of all significant down-regulated genes, thresholded based on p-value (FDR) < 0.05 and logFC < -log(2). The enrichment analysis is done from 4 data sources: GO:BP, KEGG, REAC, WP, as labeled by color.
gp_downreg_result <- gp_downreg$result[order(gp_downreg$result$p_value, decreasing = FALSE),
]
gp_downreg_result <- subset(gp_downreg_result, select = c("term_id", "source", "term_name",
"p_value"))
top_terms_downreg_1 <- gp_downreg_result[which(gp_downreg_result$source == "GO:BP"),
][top_n, ]
top_terms_downreg_2 <- gp_downreg_result[which(gp_downreg_result$source == "KEGG"),
][top_n, ]
top_terms_downreg_3 <- gp_downreg_result[which(gp_downreg_result$source == "REAC"),
][top_n, ]
top_terms_downreg_4 <- gp_downreg_result[which(gp_downreg_result$source == "WP"),
][top_n, ]
top_terms_downreg <- rbind(top_terms_downreg_1, top_terms_downreg_2, top_terms_downreg_3,
top_terms_downreg_4)
row.names(top_terms_downreg) <- NULL
kableExtra::kable_paper(knitr::kable(top_terms_downreg, format = "html", digits = 30))| term_id | source | term_name | p_value |
|---|---|---|---|
| GO:0023052 | GO:BP | signaling | 1.203247e-22 |
| GO:0007154 | GO:BP | cell communication | 2.299139e-22 |
| KEGG:04640 | KEGG | Hematopoietic cell lineage | 2.063920e-07 |
| KEGG:05310 | KEGG | Asthma | 3.664767e-05 |
| REAC:R-HSA-1280215 | REAC | Cytokine Signaling in Immune system | 7.253035e-08 |
| REAC:R-HSA-162582 | REAC | Signal Transduction | 6.763319e-06 |
| WP:WP4884 | WP | Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex | 2.876630e-04 |
| WP:WP69 | WP | T-cell receptor signaling pathway | 4.961292e-03 |
Table 9. Top two pathways (ordered by ascending p-values) returned by each of the four data sources in down-regulated genes.
1. Which method did you choose and why?
I chose to use g:Profiler via the gprofiler2 package. I chose g:Profiler because it is an up-to-date tool for functional enrichment analysis, with many of the mainstream data source readily available, allowing for comparisons between sources. I use the gprofiler2 package to submit queries directly from R, instead of manually submitting the gene list using the web tool and copy-pasting the results. This also allows for extensibility in my codes, so that the outputs are automatically updated whenever I make a change to the up-stream model. The thresholding is done via the fisher exact test and corrected using the Benjamini-Hochberg FDR procedure, which, as explained in the previous section (Differential Expression Questions), are mainstream choices for thresholding significant findings.
2. What annotation data did you use and why? What version of the annotation are you using?
I used GO Biological Pathway (GO:BP), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome (REAC), and Wikipathways (WK). All four of them are top-choice data sources that are being regularly updated. Out of the four, I have experience with GO:BP, REAC, and WK in the g:Profiler homework exercise, and I also added KEGG because it is another up-to-date source on biological pathways and it’s readily available in g:Profiler. I used 4 sources and extracted the top two terms for each source to avoid potential biases that may exist in the data sources.
All data sources are in the latest version that is available in the g:Profiler:
| Version | |
|---|---|
| GO:BP | annotations: BioMart classes: releases/2022-12-04 |
| KEGG | KEGG FTP Release 2022-12-26 |
| REAC | annotations: BioMart classes: 2022-12-28 |
| WP | 20221210 |
Table 10. Version information of the 4 data sources used in g:Profiler enrichment analysis.
3. How many genesets were returned with what thresholds?
We set the threshold to be corrected p-value (FDR) < 0.05 and a
fold change > 2 (abs(logFC) > log(2)). This threshold
is higher than the example in class since I am working with a
substantial set of genes and many of them have very high fold changes,
as shown in the Model #1 Volcano
Plot. I wish to extract the major biological pathways that were
impacted by the differential expressions (as a result of the
dexamethasone treatment), without the minor pathways convoluting the
outputs.
The thresholded dataset contains 1443 genes (both up-regulated and down-regulated), and returned a total of 834 significant (p-value (FDR) < 0.05) genesets out of the 4 data sources.
4. Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
summary_df <- data.frame(top_terms_de$source, top_terms_de$term_name, formatC(top_terms_de$p_value,
format = "e", digits = 2), top_terms_upreg$term_name, formatC(top_terms_upreg$p_value,
format = "e", digits = 2), top_terms_downreg$term_name, formatC(top_terms_downreg$p_value,
format = "e", digits = 2))
colnames(summary_df) <- c("data_source", "top_terms_whole", "p_val_whole", "top_terms_up-reg",
"p_val_up-reg", "top_terms_down-reg", "p_val_down-reg")
knitr::kable(summary_df, format = "html", digits = 30) %>%
kableExtra::kable_paper()| data_source | top_terms_whole | p_val_whole | top_terms_up-reg | p_val_up-reg | top_terms_down-reg | p_val_down-reg |
|---|---|---|---|---|---|---|
| GO:BP | immune system process | 9.87e-28 | inflammatory response | 1.29e-08 | signaling | 1.20e-22 |
| GO:BP | signaling | 2.93e-22 | cell migration | 3.19e-07 | cell communication | 2.30e-22 |
| KEGG | Hematopoietic cell lineage | 1.65e-05 | Metabolic pathways | 1.33e-05 | Hematopoietic cell lineage | 2.06e-07 |
| KEGG | Asthma | 1.14e-04 | Amino sugar and nucleotide sugar metabolism | 2.39e-02 | Asthma | 3.66e-05 |
| REAC | Immune System | 7.35e-15 | Neutrophil degranulation | 7.40e-25 | Cytokine Signaling in Immune system | 7.25e-08 |
| REAC | Neutrophil degranulation | 6.70e-09 | Innate Immune System | 1.35e-15 | Signal Transduction | 6.76e-06 |
| WP | Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex | 4.21e-03 | Nuclear receptors meta-pathway | 3.99e-04 | Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex | 2.88e-04 |
| WP | Nuclear receptors meta-pathway | 2.34e-02 | Glucocorticoid receptor pathway | 2.91e-03 | T-cell receptor signaling pathway | 4.96e-03 |
Table 11. Top two terms returned by each of the four data sources for each of the three sets of genes (all differentially expressed, up-regulated, down-regulated). This table is a merge of Table 7, 8, and 9.
Up-regulated: contains 568 genes and returned a total of 371 significant (p-value (FDR) < 0.05) genesets out of the 4 data sources.
Down-regulated: contains 875 genes and returned a total of 595 significant (p-value (FDR) < 0.05) genesets out of the 4 data sources.
From Table 11, we can see that there is a major theme among the top terms that are returned: immune system process, inflammatory response, Cytokine Signaling, and Neutrophil degranulation are all pathways related to the immune system. Other smaller categories include viral infections and hematopoietic cell lineage.
Comparing between results obtained from the whole list versus those obtained from the up-regulated or down-regulated set of genes only, we can see that most of the top terms are related (e.g. falls under the same category), if not the same.
Background and pre-processing of the data set can be found here: Background on the Data Set
Questions from previous sections can be found here:
1. Do the over-representation results support conclusions or mechanism discussed in the original paper?
Yes, the over-representation results do support some the mechanisms discussed in the original paper. Aside from the obvious pathways like “Glucocorticoid receptor pathway” (WP:WP2880) and “Hematopoietic cell lineage” (KEGG:04640), which directly relates to the background of this data set (analyzing the effect of glucocorticoids (dexamethasone) in human hematopoietic stem cells), there are several additional results that are supported. In the original paper, the authors were investigating the effect of glucocorticoids (GC) on helper innate lymphoid cells (hILCs). hILCs are characterized by the high level of cytokines that they produce, and the authors concluded that GC suppresses hILC development and activity by observing a lower level of cytokines in the presence of GC. [5] In our enrichment analyses result, “Cytokine Signaling in Immune system” (REAC:R-HSA-1280215) is the top term returned by Reactome for the set of down-regulated genes, which matches with the author’s findings.
Other than cytokine signaling, the paper also mentioned that GC has anti-inflammatory properties, which is supported by “inflammatory response” (GO:0006954) in our result for the up-regulated genes, as well as a few other more general pathways related to immune system responses.
2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.
I searched for more details on the mechanisms of GC on suppressing immune response to provide explanations for some of my results:
GC can inhibit the presentation of antigens to T cells by reducing the expression of major histocompatibility complex molecules on the surface of antigen-presenting cells, which in turn inhibits T-cell receptor signaling. [19] This supports my enrichment analysis result where “T-cell receptor signaling pathway” (WP:WP69) is one of the top-hits in the down-regulated genes.
GC is one of the most effective treatment of asthma, which a chronic inflammatory disease of the airways. GC treats asthma symptoms by promoting anti-inflammatory genes while switching off inflammatory gene expression and inhibiting inflammatory cell. [20] This supports my enrichment analysis result where “Asthma” (KEGG:05310) is one of the top-hits in both the down-regulated genes and the entire set of differentially expressed genes.
Highly potent GC (dexamethasone) was found to be an effective treatment for patients who are severely affected with the COVID-19 virus. [21] This provides some evidence for my enrichment analysis result where “Pathogenesis of SARS-CoV-2 mediated by nsp9-nsp10 complex” (WP:WP4884) is one of the top-hits in both the down-regulated genes and the entire set of differentially expressed genes. It is possible that GC regulates the viral replication process via the nsp9-nsp10 complex.